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Object oriented data analysis (OODA) aims at statistically an- 
alyzing populations of complicated objects. This paper is motivated 
by a study of cell images in cell culture biology, which highlights a 
common critical issue: choice of data objects. Instead of convention- 
ally treating either the individual cells or the wells (a container in 
which the cells are grown) as data objects, a new type of data object 
is proposed, that is the union of a well with its corresponding set of 
cells. This paper contains two parts. The first part is the image data 
analysis, which suggests empirically that the cell-well unions can be 
a better choice of data objects than the cells or the wells alone. The 
second part discusses the benefit of choosing cell-well unions as data 
objects using an illustrative example and simulations. This research 
suggests that OODA is not simply a frame work for understanding 
the structure of the data analysis. It leads to useful interdisciplinary 
discussion that gives better results through more appropriate choice 
of data objects, especially for complex data analyses. 

1. Introduction. The concept of Object Oriented Data Analysis (OODA) 
was introduced by Wang and Marron (2007) [17]. The data objects are un- 
derstood as the atoms of the statistical analysis. They could be numbers 
as taught in an elementary statistical course or vectors as in multivariate 
analysis. OODA, however, facilitates the analysis of populations of complex 
data objects. An interesting special case is functional data analysis, where 
the data objects are curves. See Ramsay and Silverman (2005) [12] for an 
overview of this type of analysis. Dryden and Mardia (1998) [4] studied ge- 
ometrical properties of objects, where the data objects are shapes. Wang 
and Marron (2007) [17], Aydin et al (2009) [2] and Shen et al (2013) [15] 
analyzed tree-structured data from medical images, where the data objects 
are trees. 

Note that the concept of data objects generalizes the classical notion of 
that of experimental units. An experimental unit is typically considered as 
one of a set of physical entities, each subjected to different experimental 
treatments. For instance, a well (i. e. a container for growing cells) with cer- 
tain growth factors. On the other hand, OODA allows much more complex 
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and abstract objects, such as images, shapes, trees, or even covariance ma- 
trices. The goal of OODA is to fully understand the data structure, choose 
appropriate data objects, and finally come up with an appropriate analysis 
oriented by this choice of data objects. For example, in tree structured data 
analyses, combinatorial trees can be chosen as data objects to study tree 
structures. In order to study the evolutionary relations among a group of 
organisms, phylogenetic trees are a good choice of data objects. See Holmes 
(1999 [5], 2003a [6], 2003b [7]) and Li et al (2000) [9]. To exploit the power 
of functional data analysis to analyze data in tree space, the Dyck path rep- 
resentations are a good choice of data objects. See Shen (2012) [14]. Note 
that OODA is about how to approach complex data analysis settings and is 
not limited to any particular data analysis methods. For example, nonpara- 
metric regression analysis of 3-d images as data objects was done in Davis 
(2008) [3] and of artery trees as data objects by Wang et al (2012) [18]. 

This paper discusses cell image analysis in cell culture biology from an 
object-oriented point of view. The motivation of this research is to develop 
a statistical approach to cell image analysis that better supports the au- 
tomated development of stem cell growth media. A major hurdle in this 
process is the need for human expertise, based on studying cells under the 
microscope, to decide when to passage the cells to new media. We aim to 
use digital imaging technology coupled with statistical analysis to tackle this 
important problem (see Section 1.1). A new type of data object is proposed 
in Section 2. Comparison with other natural choices of data objects shows 
the benefit of this choice. Section 2.3 describes the final results of the image 
data analysis based on the choice of the proposed data object. Section 3 
further discusses the advantages of the proposed data object using an il- 
lustrative example and simulations, which can be easily generalized to any 
data set with a structure of groups and corresponding individuals. 

It is seen that OODA is not only a frame work for describing data ob- 
jects, but also provides efficient terminology for making critical choices at 
the beginning of a complicated data analysis, especially in inter-disciplinary 
situations. In the example of this cell image analysis, biologists are comfort- 
able with the notions of cell and well, but do not have simple terminology 
for the union. The discussion of "what should we take as data objects?" 
allows quickly arriving at, and easy understanding by all parties involved of, 
the benefits of, the cell- well union as the best choice. Another excellent ex- 
ample of the benefits of OODA for facilitating inter-disciplinary discussion 
is in statistical acoustics research. See e.g. Aston et al (2012) [1], where the 
raw data are digitally recorded sounds of human speech. The data objects 
could be just the time series of sounds, but that might needlessly obscure 
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key aspects of speech. The data objects could also be any of various types 
of frequency analysis. In the end, motivated by careful discussion of invari- 
ance principles, that interdisciplinary group finally chose a particular type 
of covariance matrices as data objects. 

1.1. Background of Cell Images. The maintenance and growth of cells 
under controlled conditions is called cell culture. In vitro culture of cells 
taken directly from human tissues such as stem cells is, however, very dif- 
ficult. Success depends on having the right conditions for growth, which 
include the type of container, the surface coating, oxygen levels, nutrients, 
and cell-signalling molecules. The liquid containing the nutrients and cell 
signalling molecules is generally called the growth medium. Two different 
growth media, having different components, may result in very different 
outcomes. There is great medical and commercial value in developing opti- 
mal growth media for stem cells, so the development of growth media is an 
important problem in the biotechnology industry. Furthermore, the use of 
automated methods to develop new media can greatly reduce development 
costs and increase the likelihood of success. 

In order to produce enough cells for a medical procedure, cells arc grown 
through several passages (or procedures). At each passage, cells are har- 
vested and then reseeded into new vessels at a lower density, due to exten- 
sive cell-cell signalling as a function of density. Beyond a certain level of 
density, undesirable differences in morphology and phenotype arise (e.g. the 
cells are dying). So one of the most important problems in cell culture is 
deciding when to passage the cells. Cell density in a container is typically 
described in terms of confluence. The confluence of a cell culture is the per- 
centage of the surface of the container that is covered by cells. For example, 
a 100% confluent culture has cells in all surface area available for cell growth, 
whereas a 50% confluent culture has used half of the available area. Usually 
it is desirable to passage a cell culture before it reaches 100% confluence. In 
particular, stem cell cultures are often passaged at 80% confluence. 

Scientists often study images of the cells growing in the container to esti- 
mate the confluence and to decide whether or not to passage the cells. Prom 
a subjective viewpoint this is done by viewing the image and estimating 
the remaining space available for cell growth. This process is slow, manual, 
and highly variable, so being able to estimate the confluence directly from 
the image is an important capability of any automated cell growth platform. 
This estimation could be done, for example, by counting the number of cells, 
multiplying the number of cells by the average cell size and then comparing 
that area to the total surface area of the container. This approach is not 
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generally desirable in an automated system because most methods to get 
this information kill the cells. A non-destructive way to get this information 
is through bright field imaging, where one shines a bright light down through 
the top of the container and records the image of the shadows from below. 
See Figure 1 for examples of bright field images. 

However, to determine cell confluence level based on the shadows in a 
bright field image is difficult. One can hardly tell the cell number in the image 
explicitly. But some other visual factors in the image can help biologists 
make their assessment of confluence level, such as the shape of the cells 
(more accurately, the cell shadows), the amount of empty space for the cells 
to grow into and the cell path (the patterns in how the cells orient with 
respect to each other). Changes of these visual factors as the confluence 
level increases can be seen in Figure 1, where the three images are ordered 
from least confluent to most confluent. This manual assessment by biologists 
is usually subjective. Thus, it is proposed to develop a statistical approach to 
numerically summarize these visual features from an image and then make 
an objective statistical evaluation of cell confluence level. 
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Fig 1 . Pre-processed and intensity-normalized bright field images of three different wells 
from a 96-well plate of adherent stem cells, sorted from low confluence level to high con- 
fluence level. The well names are on the upper left comer. The cells correspond to the long 
thin objects. From left to right, the cell number increases, the cell shape changes, the gap 
between cells gets smaller and the cells begin to orient with respect to each other. 



A single 96-well plate of adherent stem cells from a screening experiment 
by BD Technologies is selected as the training sample. Each well is essen- 
tially a container in which cells are grown under a controlled condition. 
The culture conditions of the inner 60 wells represent a variety of culture 
conditions that support different rates of cell growth, leading to different 
confluence levels. The passaging decisions will be made on the well level, 
i.e. the cells in the same well will be passaged together. A bright field im- 
age is taken for each of the inner 60 wells (Figure 1). The boundary of 
each cell is identified, that is, the cell is segmented, using a custom script 
developed at BD Technologies with IPLab for Pathway software (http: 
//www. digit alimagingsyst ems . co.uk/ soft ware/ iplab. html). Figure 2 
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shows the corresponding cell segmentation of the three bright field images 
in Figure 1. Pixels that are identified to be interior to cells are colored cyan. 
The identified objects do not exactly cover the real cells, but this gives a 
useful approximation. 




Fig 2. Cell identification (using IPLah imaging software) of the wells shown in Figure 1. 
The well names are on the upper left corner. The cyan objects are the identified cells. 



Since the confluence level cannot be di- 
rectly and unambiguously determined in a 
bright field image, in order to get a confiu- 
ence evaluation of the 60 wells, an exper- 
iment was designed where four biologists 
were asked to assess the confiuence level 
of the 60 images. Figure 3 shows the work 
flow of this experiment. The images were 
initially ordered by well name, a random 
order of confluence level, as the condition 
of each well was chosen under a random- 
ized design. At first the biologists partici- 
pated in the experiment individually. Each 
of them sorted the images in order based 
on their own estimated confluence level 7, 
and then specified two thresholds ai and 
0-2 («i > «2) for making a passaging deci- 
sion for every image: to passage if 7 > ai, 
not clear if 02 < 7 < aij and not to pas- 
sage if 7 < a2- However, the evaluation 
results varied among biologists due to dif- 
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Fig 3. Workflow of the manual as- 
sessment of confluence level. The im- 
ages were originally ordered by name 
(B02, BOS, ...), and then sorted in 
order of the estimated confluence by 
biologists. Finally the passaging de- 
cisions were made based on the es- 
timated confluence level: to passage 
(high level), not clear (medium level) 
and not to passage (low level). 



ferent subjective perceptions of confiuence. 

After a careful discussion, the biologists finally reached a consensus assess- 
ment, referred to later as bio-assessment, which will be considered as an 
unbiased evaluation of confiuence level to judge the performance of the sta- 
tistical approach developed later in Section 2. This assessment resulted in 
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each image receiving an integer indicating the bio-rank (the rank of conflu- 
ence level), and a categorical variable indicating the bio-class: low confluence 
level, medium confluence level or high confluence level. Each bio-class cor- 
responds to a passaging decision: not to passage, not clear, or to passage. 

The goal of this research is to develop an objective and consistent ap- 
proach for assessing confluence level via statistical analysis of bright field 
images in order to better support manual passaging decisions as well as pro- 
vide the foundation for an automated passaging system. The conventional 
approach is cell number assessment (i.e. assessing cell confluence level merely 
by counting the total number of the identified cells, ignoring other image 
features), which does not match the bio-assessment very well. It is shown 
later in Section 2.3 that the alternative statistical approach proposed in this 
paper substantially improves the cell number assessment, in the sense of 
better predicting the bio- assessment. 

1.2. Feature Extraction. The bright field images are carefully pre-processed 
before extracting image features. Some standard graphical techniques [16], 
such as flat field correction and convolution filter, are used to remove un- 
even background shading and granular noise. The intensity is normalized 
across images. Two types of confluence-related features are extracted from 
the images: 

(1) Cell features, including properties of an individual cell and its rela- 
tionship with its neighbors. These features can be categorized into 
four categories, intensity, shape &: size, local density, cell orientation 
(cell path), listed in Table 1. 

(2) Entire-well features. Since cell confluence level is a function of the 
entire well instead of a simple collection of cells, some additional well- 
level, or image-level, features are also considered in evaluating conflu- 
ence level. These well-level features, such as the cell number and some 
summaries of the gap^ in the image, are summarized in Table 2. 

Due to irregular intensity distribution and irregular cell features respectively, 
two images are flagged as outliers. 

2. Object Oriented Data Analysis of Image Data. An important 
theme of OODA is that the very deflnition of data objects should be care- 
fully considered before data analysis. In this cell image analysis, different 

^The gap refers to the non-cyan area in IPLab segmented images (Figure 2). This gives 
an indication of how much more space the cells have to expand (generally, a smaller gap 
indicates a higher confluence level). Also, as IPLab identification of cells (the cyan objects) 
cannot exactly cover the real cells, the gap contains part of the cell information. 
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Table 1 
Summary of cell features. 





Categories 


Details 


# of Fea. 


1 


Intensity 


Average, Std., Average logw, Minimum, Maximum, 
the 25%, 50%, 75% quantiles of cell pixel intensity 


8 


2 


Shape & Size 


Perimeter, Area, Non-convexity, Length- Width Ratio, 
Radius Std. 


5 


3 


Local Density 


Cell densities in 5 square moving windows with differ- 
ent sizes 


5 


4 


Cell Orient. 


Cell angle. Angle difference with nearest neighbors. 
The 25%, 50%, 75% quantiles of angle differences in 4 
square moving windows with different sizes 


14 


Table 2 

Summary of additional entire-well features. 




Categories 


Details 


# of Fea. 


1 


Cell Number 


Number of identified cells in an image 


1 


2 


Cell Gap 


Summaries* of gap intensity 


6 






Summaries* of the size of circular gaps** 


6 



Standard deviation, min., max. and the 25%, 50%, 75% quantiles are used as summaries. 
These features are extracted by performing distance transformation [13] on the IPLab 
segmented image. Statistical summaries of the intensity of the resulting distance image 
are used as a description of the size of the circular gaps among cells. 



choices of data objects are available and lead to different results. Since cell 
confluence level reflects the amount of available space capacity of a well and 
the passaging decisions are made at the well level, it is natural to treat wells 
as the data objects. Meanwhile, as the cell features (Table 1) play an es- 
sential part in determining confluence level, the individual cells should be 
considered as another important aspect of the atoms of the analysis. Note 
that one could treat either the cells or wells alone as data objects. Section 2.2 
shows a benefit from analyzing both the wells and the cells together, which 
motivates consideration of a new type of data object, that is the union of a 
well with its corresponding set of cells, or the cell-well unions. Section 2.1 
describes how the choice of data objects orients further analyses. 

From an object-oriented point of view, the image data analysis is done 
in two steps: (1) Separate analyses for various choices of data objects (Sec- 
tion 2.2), which show the advantage of treating the cell- well unions as data 
objects; (2) Analysis of cell-well unions as data objects (Section 2.3), which 
provides the final results of our statistical assessment. See Section 3 for 
further discussions of the choices of data objects. 
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2.1. Data Objects and the Consequential Analyses. As discussed in Sec- 
tion 1.2, two different data sets are included in our analyses: 

• Cell data (containing cell features of each individual cell); 

• Well data (containing entire- well features of each well). 

The cell sample size is always dramatically larger than the well sample size. 
The first challenge in analyzing cell-well structured data is how to combine 
these two data sets. One natural solution is to define statistics to summarize 
the cell features across wells, and then combine the summarized cell data 
with the well data. Finally, the statistical passaging decision for each well 
will be made based on the combined data set. 

The following describes how the procedure of analysis will be oriented 
by the choice of data objects. Three different types of data objects and the 
corresponding data analyses are discussed. 

(1) Cells-alone analysis, i.e. analysis based on cells alone as data objects. 
In this analysis, only the cell data are used. The bio-assessment of a 
cell is defined the same as the bio-assessment of the well where the 
cell is cultured. The statistical passaging decision for a well is made 
based on the average of the predicted bio-classes of the individual cells 
in that well. Since all the additional entire-well features are ignored, 
one can expect that this analysis would not give a good classification 
of the passaging groups, i.e. the cells alone would not be an optimal 
choice of data objects. 

(2) Wells-alone analysis, i.e. analysis based on wells alone as data objects. 
Both the cell data and the well data are used. However, since cells are 
not chosen as data objects, no cell data analysis is done here. The basic 
idea of this wells-alone analysis is to first summarize the cell features 
across wells directly by statistics, such as quantiles, and then combine 
the summarized cell data with the well data. Finally, the statistical 
passaging decisions are made by analyzing the combined data set. 

(3) Cell-well union analysis, i.e. analysis based on cell-well unions as data 
objects. This analysis uses both the cell data and the well data. First, 
the cell data analysis finds an appropriate way to summarize the cell 
data across wells. In particular, it finds a linear combination of the cell 
features that correlates well with the bio-rank, and then takes statis- 
tics, such as quantiles, of this linear combination and its orthogonal 
PC scores across wells as the summarized cell data. Finally, the sta- 
tistical passaging decisions are made based on the combined data set 
of the summarized cell data and the well data. 
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Fig 4. Workflows of three different analyses, oriented by different choices of data objects 
respectively. Summaries refer to statistics, such as quantiles, of the cell-level features. 



The procedures of these three different analyses, oriented by the choice of 
data objects, are ihustrated in Figure 4. Comparing the cell-well union anal- 
ysis with the cells-alone analysis, it is seen that both of them begin with an 
analysis of the cell data. However, the cells-alone analysis makes passaging 
decisions simply based on the cell data analysis, while the cell-well union 
analysis includes an additional well-level analysis and makes passaging de- 
cisions based on the combined data set of the summarized cell data and 
the additional well data. It is also seen that the key difference between the 
cell- well union analysis and the wells-alone analysis is how the cell data are 
summarized across wells. The former incorporates an additional cell data 
analysis into the cell summarization. Further discussions of the choice of 
data objects in Section 3 concludes that the cell-well unions are a better 
choice of data objects than the other two choices. 

2.2. Comparison of Different Data Objects. This section aims at com- 
paring the choices of three different data objects, the cells alone, the wells 
alone, and the cell- well unions, by performing the three corresponding analy- 
ses on the cell image data separately. For the purpose of comparison, we used 
the same statistical method. Distance Weighted Discrimination (DWD), to 
make the final passaging decisions in all these three analyses. Proposed by 
Marron et al (2007) [10], DWD is a powerful classification tool, especially for 
high dimensional cases. It was used here to find the best linear separations 
between pairs of the three passaging groups and then to predict the group 
labels as the predicted passaging decisions. The consensus bio-classification, 
described in Section 1.1, will be considered as a gold standard to judge the 
performance of these analyses. 

(1) Cells-alone analysis. Figure 5 (left) visualizes the cell data in two di- 
mensions using Principal Component Analysis (PCA). The point color 
and the symbol are determined by the bio- assessment. The unclear pat- 
tern of either the colors or the symbols suggests that the confluence 
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information contained in the cell data is not obvious. We intended to 
use DWD to classify the cell data directly. Each cell in a well would 
receive a label indicating its predicted passaging group, and the pas- 
saging decision for this well would be predicted by the average label 
of the cells within this well. However, due to the large sample size of 
the cells (over 20,000), we encountered computational difficulties us- 
ing the current DWD R package by H. Huang et al (2011) [8]. As an 
alternative approach, we randomly sampled the wells and randomly 
sampled a small set of cells from each well, and then used DWD to 
classify this smaller data set. This procedure was repeated 500 times, 
and the average classification error rate was 25.1%. 

(2) Wells-alone analysis. Each cell feature was summarized into well-level 
features directly using 6 statistics: maximum, minimum, median, the 
25% and 75% quantiles and standard deviation. The dimension of the 
summarized cell data is 6 times the original dimension. Then DWD 
was performed on the combined data set of the summarized cell data 
and the well data. The classification error rate was 8.6%. 

(3) Cell-well union analysis. PCA was used to analyze the cell data, finding 
orthogonal directions that account for as much of the cell variability 
as possible. In Figure 5, the left plot shows a scatter plot of the first 
two PC scores, and the right shows only the averages across wells. It 
is seen that the vertical locations of the points refiect the order of the 
colors and the symbols, that is, PCI reveals the bio-assessment. As a 
result of this PCA, each cell had totally 32 PC scores. The same 6- 
number summaries used in the wells-alone analysis were also used here 
to summarize the collections of scores across wells. The summarized 
PC scores, considered as the summarized cell data, were then combined 
with the well data. The DWD classification of this combined data set 
gives an error rate of 5.2%. 

Compared with the cell-well union analysis, the cells-alone analysis has 
obvious disadvantages, as it ignores all the information in the well data and 
may also create computational challenges due to the large sample size of 
cells. In this image data study, the cell-well union analysis gives the lowest 
DWD classification error rate, and thus provides a set of statistical passaging 
decisions that is the most consistent with the bio-classification by biologists. 
A leave-one- well-out cross-validation shows the error rate of DWD classifica- 
tion using cell-well union analysis is 23%, and wells-alone analysis 24%. This 
very slight advantage of the cell-well unions motivates a deeper look at this 
comparison. Questions, such as how the cell-well unions gain an advantage 
over the wells alone, and whether the benefits from cell- well unions depend 
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Fig 5. PC A of the cell data. Left: Scatter plot of PCI scores vs. PC2 scores. Right: Same 
plot as the left, only showing the averages across wells. The points are colored from green 
(most confluent) to blue and then red (least confluent) according to the bio-rank. The 
symbols represent the bio-classification: to passage (cross), not clear (triangle) and not to 
passage (circle). PCI conveys a lot of information about cell confluence. 

on statistical tools or the data structure, are discussed in Section 3. 

2.3. Analysis of Cell-Well Data Objects. Section 2.2 suggests taking the 
cell-well unions as data objects and also describes the main procedure of the 
corresponding cell-well union analysis. This section provides the final results 
of the image data analysis as well as some supplementary details. 

In the cell data PCA, the first four PCs totally explain nearly 70% of the 
cell data variability. Each of them reflects one of the four cell feature cate- 
gories listed in Table 1. Particularly, PCI is mainly about cell orientation, 
PC2 about cell intensity, PC3 about cell shape and size, and PC4 about 
local density. It is seen in Figure 5 that the PCI score correlates most to 
the bio- assessment. Although most of the PCs do not correlate well with 
the bio-rank, we summarized all of the 32 PC scores across wells for further 
well-level analysis, in order to keep as much cell information as possible. 
Experience suggests that dimension reduction may increase the error rate of 
predicting passaging groups and should be avoided. 

Finally, the percentage of false passaging decisions based on this cell-well 
union analysis, 5.2%, is much lower than that from the DWD classification 
based on the cell number alone, 25.9% (A leave-one-well-out cross-validation 
shows that the former error rate is 23%, while the latter is 30%). This result 
suggests that the statistical assessment based on image features can greatly 
improve the conventional cell number assessment, and thus can better sup- 
port the automated passaging system. 

3. Illustrative Example and Simulations. This section aims at ex- 
ploring the potential generality of the superiority of cell-well union data 
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objects using a toy example and simulations. As it is shown in Section 2 
that the cell-well union analysis has obvious advantages over the cells-alone 
analysis, this section will focus on comparing the cell- well union analysis 
with the wells-alone analysis. 

Figure 4 shows that, in both the cell- well union analysis and the wells- 
alone analysis, one important step is to summarize the cell data across wells. 
The essential difference between these two analyses is whether to summarize 
based on a cell data analysis or not. After cell summarization, there is no 
difference between the workflows of these two analyses. Hence, the following 
discussions will focus on the cell data analysis and cell summarization. Sec- 
tion 3.1 shows how the cell summarization can dramatically affect the result 
of the analysis using a two-dimensional toy example. Section 3.2 extends the 
toy example into more general cases, and concludes that the cell-well union 
summaries are generally better than the wells-alone summaries. Section 3.3 
uses simulations to support the conclusion. 

In order to focus on the comparison of data objects, the following dis- 
cussions are independent of any particular statistical tools that are used to 
analyze either the cells or the wells. In fact, the study of data objects pro- 
vides suggestions of the choice of statistical tools as well as the choice of 
data objects. The basic idea is that, instead of comparing the final results 
from the cell-well union analysis and the wells-alone analysis, we compare 
the data patterns of the summarized cell data. Particularly, we assume that, 
if the summarized cell data in one analysis show a more clear pattern of 
the bio-assessment, referred to later as bio-pattern, then, no matter what 
statistical tools are used later to analyze this summarized cell data (or the 
combined data set of this summarized cell data and well data), it is easier 
to estimate the bio- assessment, and thus more probable to get a consistent 
estimation of the bio- assessment. 

3.1. Toy Example. This section aims at illustrating how different ways 
of cell summarization lead to different well-level patterns of the summarized 
cell data, using a two-dimensional toy example. Let the cell data be (xi, X2). 
For convenient visualization, we only consider one dimensional cell summa- 
rization here, that is, each cell feature Xi is summarized by a single statistic. 
After summarization, the dimension of the summarized well-level data is the 
same as the original cell data. 

Recall that, in the image data analysis in Section 2, the cell- well union 
analysis summarizes cell features based on their PC scores, while the wells- 
alone analysis is based on feature-wise summaries. This difference can be 
critical as highlighted by this toy example. Figure 6(A) illustrates the dif- 
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ference between the PC summarization and the direct summarization in 
a two-dimensional toy example based on only maxima as cell summaries. 
The red ellipse represents the cell feature distribution of a single well. The 
points Pi and P2 are the summarized well-level data from the two different 
cell summaries respectively. As long as PCI, PC2 are different from xi, X2, 
the corresponding summarized well- level data are different. The discussions 
here can be easily generalized to the cases of using other statistics, such as 
quantiles, standard deviation, etc. Note that taking the mean (the red point 
Pq) as a summary is a different case, where the summarized data from either 
the original cell features or their PCs are the same. 



Fig 6. Two-dimensional toy example based on maxima as summaries of the cells. Each 
red ellipse represents the distribution of the cell features in a well. Graph A: Two different 
summaries (Pi and P2) of the cell features of a single well, respectively corresponding to the 
cell-well union analysis (based on cell PCA, black) and the wells-alone analysis (based on 
the original cell features xi and X2, blue). Graph B: How the different cell summarizations 
preserve or impair the underlying bio-pattern of cell data. The red points (a, b, c, d, e) are 
the population means of the wells, arranged along the true direction in order of bio-rank. 
The blue points (A, B, C, D, E) are the cell summaries based on cell features {xi,X2), 
which result in poor estimates of the bio-rank. The black points (numbered 1, 2, 3, 4, 5) 
are based on the cell-level PCs, which give much better bio-rank estimates. 

For the purpose of estimating the bio- assessment, we study the approaches 
to cell summarization for passing the bio- assessment information from cell 
data to further well-level analyses, that is, how the underlying bio-pattern 
in the cell data changes after cell summarization. Figure 6(B) shows how the 
cell summarization can either impair or preserve the bio-pattern in cell data. 
Each of the five red ellipses represent the distribution of the cell features of 
a well. It is assumed that these wells have different bio-ranks, determined 
by their mean cell features (red points, labeled a, b, c, d, e, which are 
unknown in practice). The black arrow in the bottom right area shows the 
true direction of the bio-rank (practically unknown), which happens to be 
the same as the cell-level PCI. It is seen that the cell summaries based 
on (xi, X2), shown as the blue points (A, B, C, D, E), have a different 
order from the red ones (a, b, c, d, e), which lead to inconsistent estimates 
of the bio-rank. Thus the bio-pattern in the original cell data is impaired. 



A 
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However, the summaries based on cell-level PCs, shown as the black points 
(numbered 1, 2, 3, 4, 5), give consistent estimates (i.e. the black numbers 
and the red lower case letters are in the same order). That is, the bio-pattern 
is well preserved after this cell PC summarization. Hence the cell-well union 
analysis is better than the wells-alone analysis. Note that the cell feature 
distributions of the wells vary a lot in this example. If those distributions are 
consistent, i.e. the shape and size of the red ellipses are all similar, one can 
imagine that the corresponding blue points (capital letters) and the black 
ones (numbers) will have the same order, i.e. the two sets of cell summaries 
will give the same bio-rank estimates. 

In conclusion, how well the bio-pattern in the original cell data is pre- 
served after cell summarization depends on both the summarizing method 
and the data structure. If the cell feature distributions vary across wells, 
then an additional cell-level analysis, such as PCA, can help construct a 
linear combination of the cell features that is close to the true direction and 
then better pass the bio-pattern in the cell data to further well-level studies 
by summarizing cells based on this linear combination. That is, the cell-well 
union analysis is better than the wells-alone analysis. On the other hand, if 
the cell feature distributions of the wells are consistent, then either cell sum- 
marization may perform equally well in capturing the bio-rank information 
from the cell data, that is, the cell-well union analysis and the wells-alone 
analysis may have equivalent performance. 

3.2. Cell Summarization. This section extends the toy example into a 
more general case, and compares the cell-well union analysis with the wells- 
alone analysis by quantitatively studying how well the bio-pattern in cell 
data is preserved after cell summarization. Particularly, given that the cell 
feature means across wells reveal the underlying bio-rank, the variability of 
the bio-directional coefficient (defined later) of the summarized data pro- 
vides a measurement of how well the bio-pattern is preserved. The main 
conclusions about the choice of data objects are in Remark 3.1. 

The following notations are used throughout this section. Consider n 
wells and their do dimensional cell features X = (xi, X2, x^g). Let Y = 
(2/I) 2/2, ydo) ^ of orthogonal linear combinations of the cell features, 
and the statistics of Y across wells are taken as cell summaries. Note that 
X is a special case of Y. In wells-alone analysis, Y = X. In cell-well union 
analysis, Y can be, for example, the PC scores of X. Assume that the di- 
rection of the bio-rank exists and can be revealed by the cell feature means 

First, for simplicity, consider one dimensional cell summarization, with 
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each cell feature summarized by a single statistic. Let Y = {yi,y2, y^o) be 
the summarized cell features. Denote the true direction as {01,02, ...,Odf)), 
where Y^^^iof = 1. The projection of the population mean of a well 
on this direction, J2f^i Oiifii, reveals the bio-rank. Thus we assume a linear 
relationship between and the bio-rank. 

In order to measure how the bio-pattern is changed after cell summariza- 
tion, for each well, we study the projection coefficient of the summarized 
point y onto the true direction centered at the population mean referred 
to later as the bio- directional coefficient. This coefficient ip{Y) is illustrated 
as the purple line in Figure 7. Simple calculations show that 



(3.2.1) 



do 

j=l 



^2 




Fig 7. The bio- 
directional coefficient 
ip{Y) of the summary 
Y (blue point) is shown 
as the purple line in a 
two dimensional case. 
The red point is the 
population mean fi. 



If the bio-directional coefficients of the wells are con- 
stant, then the bio-pattern is very well preserved af- 
ter cell summarization. On the other hand, if these 
coefficients vary a lot, the bio-pattern is greatly im- 
paired, or even lost, after cell summarization. Thus, 
the uncertainty of the preserved bio-pattern in the 
summarized cell data can be quantitatively expressed 
by the variability of these coefficients, defined as fol- 
lows. 

Definition 3.1. Consider a one dimensional cell 
summarization. Let il^iY) he the bio- directional coef- 
ficient of the summarized data Y , defined in (3.2.1). 
Then the uncertainty of the bio-pattern in the sum- 
marized data, denoted as rjiY), is defined as the vari- 
ance of TpiY), i.e. Varyj{'tp(Y)), where the subscript 
w highlights that it is a well-level variance. 



Then, assuming a Gaussian distribution of the cell 
features, we have the following lemma (see appendix for proofs). 

Lemma 3.1. Consider do dimensional cell data X. The summarized 
data Y = {yi,y2, ...,ydo) is derived by taking a single quantile of a collection 
of orthogonal linear combinations Y = {yi,y2, ..■,ydo) of the cell features. 
Assume the cell data distributions of the wells are independent and Gaussian, 
with population mean ^ = (/xi, //2, Mrfo)- Assume the bio-rank direction 
a = (ai, a2, •••) cedo) exists, where X]f=i o^i = 1; ^.''^d is determined by /i. Note 
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that the ai's depend on the coordinate system defined by Y, i.e. a = a{Y). 
Then the uncertainty of bio-pattern afl^er cell summarization is 

(3.2.2) r](Y) = cl< a^{Y), Var^Sd^iY) >, 

where Cg is determined by the choice of the quantile, < •, • > denotes in- 
ner product, VaryjSdc{Y) = {yarwSdc{yi)., ... ,VarwSdc{ydo)) and the sub- 
scripts w and c indicate well-level and cell-level operations respectively. 

Equation (3.2.2) suggests that the uncertainty is bounded between 
Cg mmi{VarwSdc{yi)} and Cg msLXi{VarwSdc{yi)}, regardless of the cell data 
dimension do. Any cell summaries that lead to a smaller uncertainty will be 
considered better. The uncertainty r] depends on the following three aspects. 

(1) The choice of the statistic, which is reflected by the term Cg in the 
equation. Under the assumptions in the lemma, cell feature medians 
lead to a small Cq, and are the optimal choice. This is because the 
population mean of a well is assumed to determine its bio-rank. In 
practice, before cell summarization, it is always good to perform an 
exploratory analysis of the densities of y^'s to choose a suitable quantile 
which nicely reflects the bio-pattern. Hence an additional cell-level 
analysis is always preferred. However, if do is large, choosing proper 
quantiles for each yi is not feasible. 

(2) Well- level variability of the cell- level standard deviation, i.e. Var^SdciY). 
If the distributions of the cell-level data are consistent across wells, this 
term is 0. That is, whatever cell summaries are used, the uncertainty 
of the bio-pattern is 0. Thus both the wells-alone analysis and the 
cell-well union analysis give good estimates of the bio-rank. Under 
Gaussian assumptions, standardizing the cell-level data Y across wells 
by their standard deviations can reduce this term. 

(3) The choice of the orthogonal linear combinations yi, which is reflected 
by the term a'^(Y). The inner product suggests that one should con- 
sider a{Y) and Vari^SdciY) together. In the case of wells-alone anal- 
ysis, Y = X, the inner product is < a'^{X),VarwSdc{X) >, which 
is determined by the structure of the original cell data . In the cell- 
well union analysis, the cell-level analyses, such as PCA or Partial 
Least Square (PLS, taking the bio-rank as the response), can possibly 
construct a Y that gives a smaller value of this inner product (See 
Section 3.3 for simulation results). Particularly, if yi captures the true 
direction, i.e. yi is the cell data projection on the true direction of the 
bio-rank, then ai = 1, = for k ^ 1, thus the inner product is 
Var^Sdciyi). Assuming this true direction is reliable for estimating 
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bio-rank in the sense that the ceh data projections on it do not vary 
much across wells, this inner product can be very small. That is, the 
additional cell-level analysis can construct a better Y to reduce the 
uncertainty of the bio-pattern after cell summarization. Thus the cell- 
well union analysis can be better than the wells-alone analysis. Note 
that there should be no dimension reduction in Y (e.g. all the PCs 
should be included when using PC summaries), because this may lose 
useful cell information for further well-level analysis. 

The above discussions can be easily generalized to multi-dimensional cell 
summarization. It is straightforward to extend Lemma 3.1 to the following 
proposition. The main lesson learned from the one dimensional cell summa- 
rization still holds. 

Proposition 3.1. Consider a dg dimensional cell summarization, that 
is, the cell-level data Y are summarized by dg quantiles across wells. The 
dimension of the summarized data Y is dsd^. Let yij he the j-th quan- 
tile of the original cell-level feature yi, for i = l,...,do and j = l,...,ds. 
Suppose Y = {yu, ■■■,yidsJ ■■■jydoi---ydods)- Suppose the true direction in the 
summarized data space is of the form (an, ai^^, a^pi, a^g^^) where 
Sf=i SjLi ckij = 1- Under the same assumptions and notations of Lemma 
3.1, the uncertainty of the bio-pattern after cell summarization is 

(3.2.3) rjiY) = J2 4u) < Var^SddY) >, 

i=i 

where c^q) is determined by the choice of the j-th quantile, and Q(-j) = 
{aij, ...,adoj). 

It is seen that the uncertainty r]{Y) is bounded, regardless of do and dg, 
and its value also depends on the same three aspects as discussed previously 
in the case of one dimensional cell summarization. 

As a conclusion, the cell-well union analyses are generally better than the 
wells-alone analyses, as stated in the following remark. 

Remark 3.1. Consider analyzing cell-well structured data. Assume the 
direction of the bio-rank exists. In the process of summarizing the cell fea- 
tures, an additional cell-level analysis beforehand can help pass the under- 
lying bio-pattern in the cell data to further well-level analysis more consis- 
tently. Thus the cell-well union analysis estimates the bio-rank better than 
the wells-alone analysis, or the cell-well unions are a better choice of data 
objects than the wells alone. The cases where the two types of data objects 
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can be equally good are (1) The cell feature distributions across wells are 
consistent; (2) One summarizing statistic is in the order of the bio-rank. 

Additionally, the previous discussion of Lemma 3.2.2 suggests additional 
approaches to cell-level analyses. 

(1) Standardize the cell-level data of each well by their standard devia- 
tions, if the cell features are normally distributed; 

(2) Choose statistics of the cell features that reflect the bio-rank, if feasible; 

(3) Find a direction that nicely reveals the bio-rank (PCA and PLS are 
two recommended tools), and then summarize the cell features based 
on data projections in this direction and all its orthogonal directions. 

We investigated these approaches using the image data, including stan- 
dardizing the cell data for each well and summarizing cell features using PLS 
instead of PCA. The standardization, however, did not improve the results 
much, because many cell features are not normally distributed and their 
quantiles can never be effectively standardized by the standard deviations. 
The PLS led to the same classification error rate as PCA, since the PLS 
direction (taking the bio-rank as the response) was very close to PCI. To 
further investigate these approaches. Section 3.3 uses simulations. 

3.3. Simulations. This section validates the conclusions in Section 3.2 
using simulations. 

We simulated 50 wells, each with 50 to 300 cells, and each cell with 10 
features. These cell features were normally distributed. The cell feature vari- 
ances across wells were chosen as uniform (20, 500) random variables. The 
cell features of the wells had the same correlation structure (randomly gener- 
ated). Cell data of different wells were simulated independently. The popula- 
tion means of the wells determined their bio-rank. These means were linearly 
located, and the difference between the means of two neighbor wells (with 
the bio-rank difference being 1) was 0.005. This bio-rank direction has equal 
entries, and thus has the same angle with each of the cell feature axes. The 
three passaging groups were defined by two thresholds on the population 
means. The data were then standardized. 

We did both wells-alone analyses and cell-well union analyses to estimate 
the passaging groups. Five quantiles, 1%, 25%, 50%, 75% and 99%, were used 
to summarize the cell- level data, and DWD was used to classify the groups. 
The wells-alone analyses were performed on two different cell-level data sets 
separately: the original cell data, and the cell data standardized within each 
well. Two different cell summaries were considered in the cell-well union 
analyses: the PC summaries, and the PLS and its orthogonal PC summaries. 
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Table 3 shows the results of 500 simulations. It is seen that additional cell 
analyses, such as PCA or PLS, reduce the classification error rate. Thus 
the cell-well union analyses are better than the wells-alone analyses. Also, 
lower uncertainty values lead to lower classification error rates, which is 
consistent with the discussions in Section 3.2. Comparing the two wells-alone 
analyses suggests that standardizing cell data within each well reduces the 
classification error rate. 

The simulations confirm that OODA can lead to a better choice of data 
objects, i.e. the cell-well unions, which leads to significantly better results 
than those from the wells-alone analyses. 



Table 3 
Simulation results. 



Data Objects 


Wells- Alone 


Wells- Alone 


Cell- Well Unions 


Coll- Well Unions 


Cell Analyses 


Not done 


Std'il 


PCA & Std'^l 


PLS & Std'^l 


Uncertainty'^' 


1.414±0.051 


1.390±0.055 


0.471 ±0.088 


0.464 ±0.078 


DWD Error Rate'^' 


0.212±0.011 


0.132±0.009 


0.105 ±0.009 


0.104 ±0.009 



^ Standardize the cell data (or the PC/PLS scores) for each well by their standard deviations. 
^ The 95% confidence intervals from 500 simulations are shown. 



4. Conclusions. This paper has proposed a new type of data objects, 
the cell-well unions, for the analysis of cell-well structured data, motivated 
by a study of cell images. We carefully discussed the choice of different 
data objects and compared their performances. It suggests that the cell-well 
unions are a better choice of data objects than either the wells alone or the 
cells alone. This paper clearly shows how the choice of data objects orients 
further analyses. In addition to just being a frame work for understanding 
the structure of the data analysis, OODA, as effective terminology for inter- 
disciplinary communication, can guide critical choices of data objects, which 
can lead to better analyses of complex data. 
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APPENDIX 



Proof of Lemma 3.1 



/ do 

Var^ ^(^i - fJ.i)ai 



[Equation 3.2.1) 



\i=l 

do 



do 

afVar^Sdc{yi) {Under Gaussian assum,ption, iji — = CqSdc{yi)) 



i=l 
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